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Abstract 

Bacteria and archaea are characterized by an amazing metabolic diversity, which allows them to persist In diverse and often 
extreme habitats. Apart from oxygenic photosynthesis and oxidative phosphorylation, well-studied processes from 
chloroplasts and mitochondria of plants and animals, prokaryotes utilize various chemo- or llthotrophic modes, such as 
anoxygenic photosynthesis, iron oxidation and reduction, sulfate reduction, and methanogenesls. Most bioenergetic 
pathways have a similar general structure, with an electron transport chain composed of protein complexes acting as 
electron donors and acceptors, as well as a central cytochrome complex, mobile electron carriers, and an ATP synthase. 
While each pathway has been studied In considerable detail in isolation, not much is known about their relative 
evolutionary relationships. Wanting to address how this metabolic diversity evolved, we mapped the distribution of nine 
bioenergetic modes on a phylogenetic tree based on 1 6S rRNA sequences from 272 species representing the full diversity of 
prokaryotic lineages. This highlights the patchy distribution of many pathways across different lineages, and suggests either 
up to 26 Independent origins or 17 horizontal gene transfer events. Next, we used comparative genomics and phylogenetic 
analysis of all subunlts of the FqF, ATP synthase, common to most bacterial lineages regardless of their bioenergetic mode. 
Our results indicate an ancient origin of this protein complex, and no clustering based on bioenergetic mode, which 
suggests that no special modifications are needed for the ATP synthase to work with different electron transport chains. 
Moreover, examination of the ATP synthase genetic locus indicates various gene rearrangements In the different bacterial 
lineages, ancient duplications of atpl and of the beta subunit of the Fq subcomplex, as well as more recent stochastic 
lineage-specific and specles-speclfic duplications of all subunlts. We discuss the implications of the overall pattern of 
conservation and flexibility of the FqFi ATP synthase genetic locus. 
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Introduction 

Bacteria and archaea use diverse bioenergetic electron transport 
chains to generate ATP. Apart from pliotosynthesis and aerobic 
respiration, many other bacterial and archaeal bioenergetic 
pathways have been characterized in considerable biochemical 
detail (e.g. [1,2,3,4,5,6,7,8,9,10,11,12]). However, the origins of 
the diversity of bioenergetic pathways, and their evolutionary 
relationships, have so far received relatively little attention. Did 
each pathway evolve independently or did they all evolve from a 
common ancestral metabolic mode? As in organismal evolution, it 
is likely that there were some novel innovations and that parts of 
pre-existing pathways were co-opted to evolve into new pathways. 
Molecular evolutionary studies of shared proteins amongst 
prokaryotes, coupled to data from the geological record, indicate 
that the vast majority of extant bioenergetic pathways evolved 
within the first billion years from the origin of life on earth [13,14] 
and have since been mostly characterized by stasis [15]. 
Interestingly, when 16S rRNA phylogenetic analysis is carried 
out for a variety of prokaryotes, organisms that utilize difiFerent 
bioenergetic pathways don't group into clear monophyletic 



groups, i.e. closely related organisms can utilize quite distinct 
bioenergetic strategies [16,17]. This may be due to horizontal gene 
transfer [18], and highlights the challenge of deciphering the 
evolution of these pathways. 

While most previous studies have focused on comparison of the 
organisms that harbour the bioenergetic machinery, direct 
comparisons of the proteins that compose the bioenergetic 
machinery has been more limited. Most bioenergetic pathways 
use an electron transport chain (ETC) to generate a proton 
gradient across the membrane, and the energy released by the flow 
of electrons to compensate for this gradient is then used by the 
ATP synthase to generate ATP. The electron transport chains of 
disparate pathways have a similar general structure, being 
composed of protein complexes acting as electron donors and 
acceptors, with a central cytochrome 6c-type complex and mobile 
electron carriers between them. Three scenarios are envisaged for 
the early evolution of energetic flexibility in the bacteria and the 
archaea: (i) each bioenergetic pathway evolved independently, (ii) 
all bioenergetic pathways evolved from a '^simpler" ancestral 
metabolism, (ill) some new metabolic capabilities evolved by the 
modification of pre-existing pathways. The third scenario is the 
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Author Summary 

Bacteria and archaea are the nnost primitive forms of life on 
Earth, invisible to the naked eye and not extremely varied 
or impressive in their appearance. Nevertheless, they are 
characterized by an amazing metabolic diversity, especially 
in the different processes they use to generate energy in 
the form of ATP. This allows them to persist in diverse and 
often extreme habitats. Wanting to address how this 
metabolic diversity evolved, we mapped the distribution 
of nine bioenergetic modes across all the major lineages of 
bacteria and archaea. We find a patchy distribution of the 
different pathways, which suggests either frequent inno- 
vations, or gene transfer between unrelated species. We 
also examined the F-type ATP synthase, a protein complex 
which is central to all bioenergetic processes, and common 
to most types of bacteria regardless of how they harness 
energy from their environment. Our results indicate an 
ancient origin for this protein complex, and suggest that 
different species, without necessitating major innovation, 
used their pre-existing ATP synthase and adapted it to 
work with different bioenergetic pathways. We also 
describe gene duplications and rearrangements of the 
ATP synthase subunits in different lineages, which suggest 
further flexibility and robustness in the control of ATP 
synthesis. 



most likely, and has been highlighted through detailed analysis of 
the bioenergetic protein complexes, e.g. for oxygenic and 
anoxygenic photosynthesis [19,20,21]. 

The unprecedented availability of genomic data enables us to 
address evolutionary questions relating to the events that led to the 
emergence of this metabolic diversity early in the evolution of life 
on Earth. Although various studies have looked at the evolution of 
ATP synthases across the bacteria and the archaea (e.g. 
[22,23,24]), these have mosdy addressed the relative relationships 
between the F-V- and A-type ATPases, and no study has looked at 
organisms spanning the fuU bioenergetic diversity of bacteria. W e 
chose to examine the FqFi ATP synthase complex, common to 
nine bioenergetic modes, and sampled a large variety of species 
across all major lineages to establish their homology and 
evolutionary relationships. We first asked whether the evolution 
of the ATP synthase complexes in these species agrees with the 
16S rRNA phylogeny, i.e. whether they cluster according to the 
type of ETC, or based on taxonomic groups. This enables us to 
check for horizontal gene transfer events concerning the ATP 
synthase, as well as for putative specific modifications in the ATP 
synthase subunits associated with each bioenergetic mode. We also 
examined the structure of the FqFi ATP synthase genetic locus, 
and report a variety of both ancient and recent gene duplications 
and rearrangements. 

Results 

No monophyly of bioenergetic modes 

In this study, we focused on nine pathways most of which have 
been well characterized at the biochemical level, and for which 
enough sequence information is available to enable assessment of 
the diversity within each group as well as inter-group relationships: 

(i) Oxygenic photo.synthesis (cyanobacteria, e.g. Synechococ- 

(ii) Anoxygenic photosynthesis (green sulfur bacteria, e.g. 
Chlorohium; green non-sulfur bacteria, e.g. Chloroflexus; 



proteobacteria, e.g. Chromalium, Rhodospirillum, Rhodop- 
seudomonas; heliobacteria, e.g. Heliohacterium) 

(iii) Methanogenesis (methanogenic archaea, e.g. Melhanosar- 
cina, Methanococcus) 

(iv) Sulfate reduction (bacteria, e.g. Desulfovihrio , and archaea, 
e.g. Archaeoglohus) 

(v) Sulfur reduction (bacteria, e.g. Sulfurospiyillum, and 
archaea, e.g. Ignicoccus) 

(vi) Sulfur oxidation (e.g. Sulfurimonas) 

(vii) Iron oxidation (bacteria, e.g. Acidithiohacillus , and ar- 
chaea, e.g. Ferroplasma) 

(viii) Iron reduction (e.g. Geobacter) 

(ix) Aerobic respiration (heterotrophs, e.g. E. cnli) 

Species, whose complete genomes are available, were chosen to 
represent all major lineages of bacteria and archaea, and all the 
above bioenergetic modes. Information about the metabohsm 
(bioenergetic mode) of each species was collected from the species 
description at the NCBI BioProject database, as well as from the 
Integrated Microbial Genomes database. Full details of the 198 
bacteria and 74 archaea species selected are given in Table SI, 
while the number of species from each lineage, and each 
bioenergetic mode is shown in Table 1. As has been observed in 
previous analyses [16,17,18], certain bioenergetic modes can be 
shared by quite distinct taxonomic groups. Indeed, as demon- 
strated by 16S rRNA phylogenetic analysis of the organisms 
examined here (Figure 1), species which utilize the same bioen- 
ergetic modes do not always segregate in moiiophyletic groups. 

Inferring the origin of each bioenergetic mode is therefore 
confounded by their patchy distribution among the prokaryotes. 
Oxygenic photosynthesis is the only bioenergetic mode which is 
unique to a lineage (the cyanobacteria). Oxidative phosphorylation 
(respiration) is shared by the greatest variety of hneages, and as 
such, can be considered as an ancient mode of generating energy 
in both the bacteria and the archaea, while methanogenesis is 
found in seven lineages within the euryarchaea, and as such can be 
considered ancient to this group However, anoxygenic photosyn- 
thesis, sulfur reduction, sulfate reduction, sulfur oxidation, iron 
reduction and iron oxidation are found in more than one lineage, 
which are not closely related. The presence of the same pathway in 
these distinct lineages, can come about by one of three processes: 
either (a) all bioenergetic modes were found in the common 
ancestor of these lineages, and some have been lost from some 
lineages, or (b) bioenergetic modes were acquired by distinct 
lineages by horizontal gene transfer (HGT), or (c) some electron 
transport chains originated multiple times independently in 
different lineages. The most parsimonious explanation is probably 
HGT, since, based on the phylogenetic tree of Figxire 1 , and as 
summarized at the bottom of Table 1, the distribution of 
bioenergetic pathways can be explained by up to 26 independent 
origins or, alternatively, 17 horizontal gene transfer events. Four 
HGT events can be inferred for iron oxidation, three HGT events 
can be inferred for anoxygenic photosynthesis, sulfate reduction, 
sulfur oxidation and iron reduction, and one HGT event can 
explain the distribution of sulfur reduction (Table 1). These 
inferences are based on minimal assumptions of lineage groupings 
(e.g. for the alpha- beta- gamma- and delta-proteobacteria) as the 
branching order of prokaryotic lineages is still largely unresolved 
[25,26,27,28,29]; the lineage-groupings seen in a more recent and 
better-resolved bacterial phylogeny [30] still do not change these 
numbers. Moreover, while iron reduction, and anoxygenic 
photosynthesis are specific to the bacteria, the other modes 
(sulfate reduction, sulfur reduction, sulfur oxidation, and iron 
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He: heterotrophs 
OP: oxygenic photosynthesis 
AP: anoxygenic photosynthesis 
IVIe: methanogenesis 
SR: sulfate reduction (arsenate reduction) 
SfR; suifur reduction 
SO: sulfur oxidation 
FR: iron reduction 
FO: iron oxidation 



y-proteobacteria 



a-proteobacteria 



chloroflexi 



thermotogae 
thermoproteales 



sulfolobales 



desulfurococcales 



P-proteobacterIa 
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proteobacteria 



acidobacteria 
deinococci 



* crenarchaeota 
** euryarchaeota 




cyanobacteria 



RaxlVIL bootstrap values 
• 100% 
e >80% 



Figure 1. Phylogenetic reconstruction based on 16S rRNA sequences to map tKie taxonomic distribution of bioenergetic pathways. 

272 prol<aryotic species are shown, whose full genome sequence is available, and which represent the full diversity of bacteria and archaea, colour- 
coded based on their bioenergetic mode. Bootstrap values for highly supported nodes have been replaced by symbols, as indicated. The full species 
names, as well as details and accession numbers for all sequences used are given in Table SI. The tree shown was produced by RaxML, and its 
topology broadly agrees with the one produced by PhyML (the analysis based on MrBayes did not converge after 5 million generations when all 
sequences were included; however, when the bacteria and the archaea were examined separately, the MrBayes analysis also agreed with the RaxML 
and PhyML results). 
doi:1 0.1 371 /journal.pcbi.1 003821. gOOl 



oxidation) are found in both bacteria and archaea. Notably, 
certain lineages seem more prone to bioenergetic diversity than 
others. For example, five bioenergetic modes are seen within the 
gamma-proteobacteria and the firmicutes; four bioenergetic 
modes are found within the alpha-proteobacteria, three bioener- 
getic modes are found within the beta- the delta- and the epsUon- 
proteobacteria, the aquificae, and the sulfolobales; two bioener- 
getic modes are found within the deinococci, the acidobacteria, the 
actinobacteria, the thermoproteales, the desulfurococcales, and the 
thermoplasmata, while sulfate reduction and iron oxidation are 
both seen in the archaeoglobi. However, this may be influenced by 
how many complete genomes are available per lineage, and how 
well this represents the true diversity in each lineage [31]. This 
picture may thus change in the future, as more diverse organisms 
are sequenced. 



Phylogenetic analysis of the ATP synthase genes 

As the ATP synthase complex is common to all the electron 
transport chains of the studied bioenergetic modes, we chose to 
study the evolution of this complex in the different lineages. To 
examine whether the ATP synthase complex which is associated 
with the different bioenergetic modes was also subject to HGT, we 
performed phylogenetic analysis of all the protein subunits of the 
FoFi ATP synthase, as this is shared by most of the bacterial 
lineages. However, archaea and certain bacterial species/lineages 
lack ATPFqFi altogether, and have ATPV instead: Clostridium 
tetani and Thermoanaerobacter sp. X513 (Clostridia), Chlamydia 
trachomatis and Chlamydophila pneumoniae (chlamydiae), Deino- 
coccus radiodurans, Thermus scotoductus and Thermus thermo- 
philus (deinococci), Fibrobacter succinogenes (fibrobacteres), Bor- 
relia burgdorferi, Spirochaeta thermophila and Treponema 



PLOS Computational Biology | www.ploscompbiol.org 



5 



September 2014 | Volume 10 | Issue 9 | el 003821 



Evolution of the FqFi ATP Synthase and Different Bioenergetic Pathways 



He: heterotrophs 

OP: oxygenic photosynthesis 

AP: anoxygenic photosynthesis 

Me: methanogenesis 

SR: sulfate/arsenate reduction 

SfR: sulfur reduction 

SO: sulfur oxidation 

FR: iron reduction 

FO: iron oxidation 



MrBayes/PhyML/RaxML 
• >0.99/95%/95% 
O >0.95/80%/80% 
O >0.80/50%/50% 




planctomycetes 
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Figure 2. Phylogenetic reconstruction of ATPFOA. The tree shown is the best Bayesian topology, based on 21 5 sequences and 232 amino acid 
positions (length after trimming; median sequence length before trimming: 254). Numerical values at the nodes of the tree (x/y/z) indicate statistical 
support by MrBayes, PhyML and RAxIML (posterior probability, bootstrap and bootstrap, respectively). Values for highly supported nodes have been 
replaced by symbols, as indicated. Species names are colour-coded based on their bioenergetic mode as in Figure 1. Full details and accession 
numbers for all protein sequences used are given in Table SI. The tree is rooted at the N-ATPase clade, previously reported to be the result of 
horizontal gene transfer in a variety of species, all of which also contain a canonical ATPFqFi (apart from the two Methanosarcina species shown, 
which also have a canonical ATPV). The tree confidently separates the major bacterial taxonomic lineages, but with limited support for their 
branching order: strong support is provided for a subgroup containing the verrucomicrobia and chloroflexi, while another subgroup containing the 
alpha-proteobacteria, actinobacteria, chlorobi, bacteroidetes and planctomycetes also has reasonable support. This group also includes the 
spirochaete Leptospira interrogans and the gemmatimonadete Gemmatimonas aurantiaca, as well as Candidatus Nitrospira defluvii which groups with 
the alpha-proteobacteria. Reasonable support is also provided for the grouping of dictyoglomi and cyanobacteria, and for a subgroup containing the 
fusobacteria, firmicutes, tenericutes, thermotogae, and beta-gamma-proteobacteria. Two species-specific duplications (in Saccharopolyspora 
erythraea and Pelobacter carbinolicus) are highlighted with a red ">". Two further duplications are highlighted with a red "-" after the species name; 
in Photobacterium profundum the duplication either occurred before the split from other closely-related species or represents HGT from other 
gamma-proteobacteria; the duplication in Desulfococcus oleovorans possibly represents HGT from thermotogae (also see Figures SI and S2). 
doi:l 0.1 371/journal.pcbi.l 003821 .g002 



palliiliiiii i'spirocliactaccac). Aiiiiiiohacterium colomhiense and 
Thermanaerovibrio acidaminovorans (synergistetes), Candidatus 
Phytoplasma mali (moUicutes). As most subunits of the V-type and 
the F-type ATPases are not homologous [24], we chose to focus 
solely on the FdF] ATP synthase. 

Gene sequences were identified using KEGG orthology 
annotations, both by searching the KEGG orthology tables, and 
by manual searches in IMG (for the species not included in 
KEGG). The bacterial FqFi ATP synthase complex is composed of 
the Fq subcomplex, which is embedded in the membrane, and the 
F| subcomplex which protrudes on the side of the membrane 
towards which the protons exit following the proton gradient. The 
Fo subcomplex is composed of ATPFOA (K02108), ATPFOB 
(K02109), and ATPFOC (K02110), whUe the Fi subcomplex is 
composed of ATPFIA (K02111), ATPFIB (K02112), ATPFID 
(K021 13), ATPFIE (K021 14), and ATPFIG (K021 15). The genes 
encoding these subunits are usually arranged consecutively in a 
conser\'ed genetic locus, which also includes another subunit, 
ATPI (K02116) and sometimes atpR. K02116 is interchangeably 
associated with two pfam domains, which makes orthologous gene 
assignments problematic: for consistency in the text below, ATPI 
sequences containing the pfam03899-ATP_synthI domain wiU be 
called "si", and ATPI sequences containing the pfam09527- 
ATPase_genel will be called "I": atpR serjuences containing the 
pfaml2966-atpR domain will be called "R". 

For each subunit, the corresponding protein sequences were 
downloaded from KEGG for all species and, after multiple 
aligrraient, phylogenetic analysis was performed using Bayesian 
and maximum likelihood methods. The phylogenetic analysis for 
ATPFOA and ATPFIA are shown in Figures 2 and 3, respectively, 
while the rest of the trees are in Figures SI, S2, S3, S4, S5, S6, S7. 
Overall, for all subunits, species segregate based on taxonomic 
groups with good bootstrap support, as in the 16S tree, and not 
based on bioenergetic mode. If the current patchy distribution of 
bioenergetic modes (Figure 1) is due to HGT, we might expect the 
ATP synthase sequences from different organisms which utilize the 
same pathway to group together (as we used different colours for 
the different bioenergetic modes for species names on the tree, we 
would essentially expect to see organisms grouping based on 
colour). This is not what we observe, suggesting that there is no 
evidence of HGT of the ATP synthase despite the use of different 
bioenergetic mode's bet\vc'en dosely related species. 

Nevertheless, in certain species, a duplication of the whole 
ATPFOFl locus is seen (Table 2), and the majority of those 
duplications correspond to the so-called N-ATPase, which appears 
to have been acquired via horizontal gene transfer, as has been 
reported previously [32]. The N-ATPase genetic locus is 
characterized by the absence of the ATPFID gene and the 
presence of the atpR gene (Figure 4) as well as a long (>100aa) 



C-terminal extension in ATPFOB (Dataset SI). For the set of 
organisms studied here, the N-ATPase is found in certain species 
of planctomycetes (Rhodopirellula baltica), verrucomicrobia 
{Methylacidiphilum infernorum), chlorobi {Chlorobaculum parvum, 
Chlorobaculum tepidum (partial), Pelodictyon luteolum, Prostheco- 
chloris aestuarii), cyanobacteria {Acaryochloris marina, Cyanolhece 
sp. ATCC 51142, Synechococcus sp. PCC 7002), alpha-proteo- 
bacteria (Azospirillum sp. B510, Dinoroseohacler shibae, Rhodop- 
seudomonas palustris, Rhodospirillum cenlenum/ Rhodocista cen- 
tenaria), beta-proteobacteria {Rhodoferax ferrireducens), gamma- 
proteobacteria (Nilrosococcus halophilus - double N-ATPase, one 
locus split, both missing atpR), delta-proteobarteria (Demlfobac- 
terium auloirophicum, Desulfohidhus propionicus, Desulfomicro- 
bium baculatum, Desulfovibrio salexigens, Desulfuromonas acetox- 
idans, Pelobacter carbinolicus) and methanomicrobia 
{Methanosarcina acetivorans, Methanosarcina barkeri). 

The sequences corresponding to the N-ATPase form a highly 
supported monophyletic group; the trees (apart from ATPFID) 
were therefore rooted at this N-ATPase clade. Phylogenetic 
reconstruction of all subunits confidently separates the major 
bacterial taxonomic lineages, but the trees only give limited 
support for the branching order (Figures 2-3, SI, S2, S3, S4, S5, 
S6). The differences between trees, with respect to the resolution of 
the branching order of different lineages, are probably due to the 
sequence length of the proteins analyzed; longer subunits retain 
more information and tend to give better-resolved phylogenetic 
trees, than shorter sequences [33]. The most clear-cut grouping is 
that of the beta- and gamma-proteobacteria, which is seen in all 
trees, and has significant bootstrap support in all but the ATPFID 
and ATPFIE trees. Significant bootstrap support for the beta- and 
gamma-proteobacteria grouping is also seen in the 16S phyloge- 
netic analysis (Figure 1), which also suggests groupings of the 
chlorobi and the bacteroidetes, and of the fusobacteria and 
tenericutes. The phylogenetic link between the chlorobi and the 
bacteroidetes is also seen in the trees for ATPFOA (Figure 2), 
ATPFOC (Figure S2), ATPFIA (Figure 3) and ATFPIB (Figure 
S3). In the ATPFOC analysis this group also iniludcs the 
planctomycetes as well as the spirochaete Leptospira interrogans 
and the gemmatimonadete Gemmatimonas aurantiaca {Leptospira 
interrogans also groups with the planctomycetes in the ATPFIA 
phylogeny). The ATPFOA phylogeny also has reasonable support 
for grouping the chlorobi, bacteroidetes and planctomycetes, 
together with the actinobacteria and the alpha-proteobacteria (this 
group also includes the spirochaete Leptospira interrogans and the 
gemmatimonadete Gemmatimonas aurantiaca, as well as Candi- 
datus Nitrospira defluvii which groups with the alpha-proteobac- 
teria; Candidatus Nitrospira defluvii also groups with the alpha- 
proteobacteria in the ATPFOC analysis). A group containing the 
actinobacteria and the planctomycetes (as well as the spirochaete 
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AP: anoxygenic photosynthesis 

Me: methanogenesis 

SR: sulfate/arsenate reduction 

SfR: sulfur reduction 

SO: sulfur oxidation 

FR: iron reduction 

FO: iron oxidation 



MrBayes/PhyML/RaxML 
• >0.99/95%/95% 
O >0.95/80%/80% 
O >0.80/50%/50% 




dictyoglomi 
thermotogae 
planctomycetes 
actinobacteria 

cyanobacteria 



bacteroidetes 
chlorobi 



verrucomicrobia 
chloroflexi 



beta-gammaproteobacteria 



acidobacteria 



alphaproteobacteria 



deltaportoebacteria 



deltaproteobacteria 



deferribacteres 
aquificae 



epsilonproteobacteria 



tenericutes 



PLOS Computational Biology | www.ploscompbiol.org 



8 



September 2014 | Volume 10 | Issue 9 | el 003821 



Evolution of the FoF, ATP Synthase and Different Bioenergetic Pathways 



Figure 3. Phylogenetic reconstruction of ATPFIA.The tree shown is the best Bayesian topology, based on 215 sequences and 502 amino acid 
positions (length after trimming; median sequence length before trimming: 508). Numerical values at the nodes of the tree (x/y/z) indicate statistical 
support by MrBayes, PhyML and RAxIML (posterior probability, bootstrap and bootstrap, respectively). Values for highly supported nodes have been 
replaced by symbols, as indicated. Species names are colour-coded based on their bioenergetic mode as in Figure 1. Full details and accession 
numbers for all protein sequences used are given in Table SI. The tree is rooted at the N-ATPase clade, previously reported to be the result of 
horizontal gene transfer in a variety of species, all of which also contain a canonical ATPFqFi (apart from the two Methanosarcina species shown, 
which also have a canonical ATPV). The tree confidently separates the major bacterial taxonomic lineages, but with limited support for their 
branching order: reasonable support is only provided for one subgroup containing the chlorobi, and the bacteroidetes. The spirochaete Leptospira 
interrogans groups with the planctomycetes. Two species-specific duplications (in Photobacterium profundum and Pelobacter carbinolicus) are 
highlighted with a red ">". Two further duplications within the tenericutes are highlighted with a red "-" after the species name; this duplication 
likely happened before the split between Mycoplasma agalactiae and Ureaplasma parvum. 
doi:1 0.1 371/journal.pcbi.l 003821 .g003 



Leptospira interrogans and the gemmatimonadete Gemmatimonas 
aurantiaca) is supported by the ATPFIG tree. Strong support is 
provided by the ATPFOA phylogeny for a group containing the 
verrucomicrobia and chloroflexi; the phylogenetic reconstruction 
of ATPFIG (Figure S6) also has reasonable support for a group 
containing the verrucomicrobia, chloroflexi, and the beta-gamma- 
proteobacteria. Finally, reasonable support is provided in the 
ATPFOA tree for the grouping of dictyoglomi and cyanobacteria, 
and for a group containing the fusobacteria, firmicutes, tener- 
icutes, thermotogae, and beta-gamma-proteobacteria. In the 
ATPFOC analysis, the dictyglomi cluster with the N-ATPase with 
good statistical support (Figure S2). 

Although the phylogenetic analysis is based on trimmed 
sequences, i.e. only the unambiguous homologous regions were 
retained for phylogenetic analysis by manually inspecting and 
masking/ trimming the sequences, some notable insertions/ dele- 
tions were noted in the multiple alignments. For example, the 
chlorobi and the bacteroidetes are both missing the C-terminal 
half of ATPFIE, and share an internal 10— 15aa insertion in 
ATPFIA. A different internal 10-15aa insertion in ATPFIA is 
shared between the beta- and gamma-proteobacteria. Actinobac- 
teria have a ~75aa insertion near the N-terminus of ATPFID, 
and cyanobacteria have an internal 20aa insertion in ATPFIG. 
The N-ATPase ATPFIA in Azospirillum sp. B510 has a long 
(~100aa) N-terminal extension plus a ~150aa insertion near the 
N-terminus, while the N-ATPase ATPFIG in Cyanothece sp. 
ATCC 51 142 has a 50aa N-terminal extension (Dataset SI). The 
elucidation of the role of these signature sequences would require 
further study based on experimental or structural analysis. 

Genetic locus organization of the ATP synthase genes 

Given the ancient origin of the ATP synthase complex, the 
syntenic genetic location of the genes was checked in all lineages, 
to identify common gene order transversions, gene duplications, 
and possible horizontal gene transfer events (Figure 4). The N- 
ATPase, which has been suggested to be an early-diverging branch 
of membrane ATPases [32] has the following gene order: IB-IE-I- 
R-OA-OC-OB-IA-IG. Bacteroides fragilis also has a similar gene 
loi:us organization, except that it lacks atpR. The subunits are 
arranged in consecutive order (i.e. the locus is not split) in the 
dictyoglomi, planctomycetes, firmicutes, thermotogae, chloroflexi, 
actinobacteria, tenericutes, verrucomicrobia, fusobacteria and the 
beta- and gamma-proteobacteria. Except for the proteobacteria 
and the verrucomicrobia, these hneages have been suggested to be 
near the base of the bacterial clade, either based on phylogenetic 
analysis [25,31] or based on the analysis of signature sequences 
[26,30]. By inference, the most likely ancient gene order for the 
ATPFqFi locus is: I-sI-OA-OC-OB-ID-IA-IG-IB-IE, although some 
lineages lack I or si or both (e.g. fusobacteria, chloroflexi, 
verrucomicrobia) . 

The locus has been split (indicated by semi-colons in Figure 4) 
at the junction between IG and IB in the chlorobi, bacteroidetes. 



cyanobacteria, aquificae and Beggialoa, with further splits 
between IB and IE in aquificae and Beggiatoa. A further split is 
seen between ID and lA in the chlorobi and between lA and IG in 
aquificae and Beggiatoa. A split between OB and ID is seen in 
nitrospirae and the alpha-proteobacteria, while a split between OC 
and OB is seen in aquificae, acidobacteria, deferribacteres, and 
delta- and epsUon-proteobacteria. A spUt between OA and OC has 
occurred in the epsUon-proteobacteria. Finally a split between I 
and OA is seen in aquificae. Therefore, although there are three 
"blocks" of genes which are usually conserved, in terms of gene 
order (one containing I(-sl)-OA-OC(-OB')-OB, another containing 
ID-IA-IG, and another with IB-IE), in principle, gene-order 
transversion can and has happened all along the genetic locus. 

Most commonly duplicated/lost genes 

The phylogenetic analysis and the gene locus information were 
used to examine the most likely origin of duplicated genes, i.e. 
whether they arose as gene duplications within a particular species, 
or via horizontal gene transfer (Table 2). In the delta-proteobac- 
terium Pelobacter carbinolicus, there are two duplications of the 
whole ATPF()Fi locus, one corresponds to the N-ATPase, and the 
other is a species-specific duplication (Figures 2-3, SI, S2, S3, S4, 
S5, S6). A duplicated ATPFqFi fiiU locus, whi[:h does not 
correspond to the N-ATPase was also found in the gamma- 
proteobacterium Photobacterium profundum; this appears as a 
species-specific duplication in the ATPFIA, ATPFIB and 
ATPFIE trees (Figures 3, S3, S5), while in the rest of the trees, 
one copy groups with Vibrio cholerae and the other elsewhere 
within the gamma clade (Figure 2, SI, S2, S4, S6). This possibly 
hints at HGT from another closely related species, but the 
placement wdthin the gamma clade is not consistent and could thus 
simply be due to high sequence divergence of the second copy in 
P. profundum for some of the subunits. 

There are also certain in-locus gene duplications, where the 
duplicated genes are stiU found adjacent to each other on the 
genetic locus, as well as ectopic duplications (outside the main 
ATPFqFi locus) probably resulting from recombinations/trans- 
versions (summarized in Table 2). The most commonly in-locus 
duplicated genes are ATPFOB and ATPI, discussed in more detail 
in the next section. The delta-proteobacterium Desulfococcus 
oleovorans has a fuUy duplicated ectopic ATPFO complement; in 
the ATPFOB phylogeny (Figure SI) both copies group within the 
delta-proteobacteria suggesting that this could be a species-specific 
duplication where one copy has diversified. However, the 
duplicated ATPFOA (Figure 2) and ATPFOC (Figure S2) subunits 
group with the thermotogae with good bootstrap support, hinting 
at a possible HGT event; assuming a common origin for all three 
subunits in the duplicated locus, this suggestion of HGT from 
Thermotogae requires further study (phylogenetic analysis of only 
the deltaproteobacteria and thermotogae sequences did not 
resolve this issue as it gives the same results as above for the 
duplicated subunits, data not shown). The actinobacterium 
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N-ATPase (missing ID): IB-IE-l-R-Q^-OC-OB-i^fi 
Dictyoglomi (missing ID): I-sI(?)-OA-OC-OB-]aTC-IB-IE 

Planctomycetes*: l-sl(?)-OA-OC-OB-lUilMG-IB-IE 

Firmicutes*/Thermotogae*: I-sI-|A-OC-OB-^^B-IB-IE 

Chloroflexi (He_aurAP and Chloroflexales missing I): I-&A-0C-0B-ID-IA-IC3-IB-IE 

Actinobacteria*/Tenericutes*: l(?)-OA-OC-OB-ID-IA-IG-IB-IE 

Verrucomicrobia (missing l/sl): OA-O C-OB-ICUAJa-IB-IE 

Fusobacteria/Beta-gammaproteobacteria: sl-TO-OC-OB-^^^W-IB-IE 

(Po_necHe, Bu_aphHe missing si) 

Ha_neaAP, Th_aerSO, Th_cycSO, Th_cruSO: l-sl-OA-OC-OB-ID-IA-IG-IB-IE 

Be_PSSO: 
Chlorobi: 
Bacteroidetes*: 

Ba_fraHe: 
Nitrospirae (CNi_deHe): 
Gemmatimonadetes (Ge_aurHe): 

Cyanobacteria (Mi_chtOP missing ib-ie): 

Sy_JA20P, Sy_JA30P: 

Th_eloOP, GI_vioOP: 
Aquificae (Su_azoFR): 

(Hy_Y04SO): 

Aq_aeoHe (missing I): 

Alpha-proteobacteria (Ac_cryFR missing I): 

Epsilon-proteobacteria: 

Delta-proteobacteria/Acidobacteria*/ 
Deferribacteres/ Chrysiogenetes (Ba_S5He) 

Pe_carFR (extra gene locus): 

Figure 4. ATPFoFi gene locus organization per lineage. The ATPFqF, gene locus organization was checked for all species in the IMG database 
[47], and is summarized per lineage. The gene order shown follows the order in which the genes are transcribed in each genome (upstream to 
downstream). Semicolons indicate that the separated gene groups are on non-adjacent genetic locations (and can be very far upstream or 
downstream; e.g. separated by only 4 intervening ORFs in Geobacter sp. FRC-32, and by up to 5026 intervening ORFs, or 6 Mb, in Nostoc sp. PCC 71 20; 
see Table SI). When the locus is split, the genes are shown in the order they are usually found in when the locus is intact. ATPFOB (K02109) is often 
duplicated, so one copy is called OB, and the other OB', based on the gene order. ATPI (K021 1 6) is also often duplicated, and is designated "\" "si" and 
"R" based on the presence of distinct pfam domains, as discussed in the text. Question marks indicate that the ATPI subunit is sometimes not clearly 



sl-X-OA-OC-OB-l£^Jfi| IB(2)| IE 
l-sl(?)-OA-OC-Od-H^H IB-IE 
l-sl(?)-OA-OC-OB- ID-IA-IGl IB-IE 
IB-IE- l(?)-OA-OC-OB-ID-IA:l^ 

i-bA-oc-OB| mM-ib-ie 

l-sl(?)-OA-OC^-ll| IWS-IB-IE 

sl-OA-OC-OB'-OB-lQil^Jfil IB-IE 
sl-OA-OC-OB'-O^-^^H IB| IE 

sl-0A-0C-0B'-0B-TO17^re| IB| IE 
l| OA-X-OC| OB'-OB-ID-IA-ld-IB| IE 
l| ^-X-OC| OB'-OB-iyil ifil IB| IE 
■-X-OC| OB'-OB-HI iq V^-IB| IE 

l-bA-OC-OB'-OB| ID-IA-IG-IB-IE 
■I 0A| 0C| OB'-OB- ID-IA-IG -IB-IE 

l-sl-OA-OC| OB'-QB-ID-IA-IQ-IB-IE 
OB'-OB-HHl-IB-IE-OA-(39 
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assigned to the orthology group. "X" denotes hypothetical intervening ORFs. Notable variations within some lineages are shown. *Especially for 
lineages represented by relatively few species, please see TableSI for variations between the species examined within each lineage. 
doi:1 0.1 371/journal.pcbi.l 003821 .g004 



Saccharopolyspora erylhraea has a duplicated ectopic ATPFOA, 
which looks like a species-specific duplication (Figure 2). The zeta- 
proteobacterium Mariprofundus ferrooxydans has a duplicated 
ectopic ATPFOB; it is unclear if this is the result of HGT, as the 
sequence groups with planctomycetes, but not with high bootstrap 
support (Figure SI). The bnacait Alkaliphilus melalliredigens has 
a species-specific in-locus duplication of ATPFOC (Figure S2) 
which is characterized by a long (~100aa) N-terminal extension 
(Dataset SI). Ectopic duplications of ATPFIA and ATPFIB are 
seen in Mycoplasma agalactiae and Ureaplasma parvum (tener- 
icutes) as has been reported recently [34]; this duplication likely 
happened before the spht between the two species (Figure 3, S3); 
one of the ATPFIA copies in U. parvum has a long (~250aa) C- 
terminal extension (Dataset SI). Ureaplasma parvum also has 
duplicated ATPFID in-locus; the evolutionary history of this 
duplication cannot be clearly inferred from the phylogenetic 
analysis, although it appears to be species-specific in the PhyML 
and RaxML trees, but is not statistically supported by high 
bootstrap values. ATPFIE is duplicated ectopically in Maripro- 
fundus ferrooxydans (zeta-proteobacteria), Thiobacillus denitrifi- 
cans (beta-proteobacteria), and the gamma-proteobacteria 
Acidithiobacillus caldus, Acidithiohacillus ferrivorans, and Acid- 
ithiohacillus ferrooxidans (two extra copies), as well as in-locus in 
the d(;lta-prote()ba(:teria Desulfovihrio magnelicus and Desulfovi- 
brio sp. FW1012B. The duplication in M. ferrooxydans is species- 
specific, while the other duplications are hneage-specific, i.e. the 
duplication either occurred before the split from other closely- 
related species or represents HGT from other closely-related 
species (Figure S5): the duplication in T. denitrificans may 
represent HGT from other gamma-proteobacteria; a duplication 
occurred before tlu- split between the gamma-proteobacteria 
Acidithiobacillus ferrooxidans, Acidithiobacillus ferrivorans and 
Acidithiobacillus caldus, with a further species-specific duplication 
in Acidithiobacillus ferrooxidans; another duplication occurred 
before the split between Desulfovihrio magnelicus and Desulfovi- 
brio sp. F\V1()12B in the delta-pr()t(X)bactcria. ATPFIG is 
ectopically duplicated in Aquifex aeolicus (aquificae; one copy 
has a 80aa C-terminal extension) and Acidithiobacillus ferroox- 
idans (gamma-proteobacteria; one copy is missing the N-terminal 
half); the duplication in Aquifex aeolicus represents a very 
divergent sequence which groups with the dictyoglomi in the 
MrBayes and PhyML trees (Figure S6); the duplication in A. 
ferrooxidans might be a pseudogene as it is much smaller in size - 
in the tree it clusters with the N-ATPase genes. 

ATPFOB is duplicated in-locus in acidobacteria, aquificae, 
cyanobacteria, deferribacteres, and alpha- delta- and epsUon- 
proteobacteria. This raises the question of whether ATPFOB has 
been duplicated independently in separate lineages, or whether the 
duplication has been passed on, either by direct descent, or by 
horizontal gene transfer. In the phylogenetic analysis (Figure SI) 
the ATPFOB' group in the alpha-proteobacteria appears as a sister 
group to the alpha-proteobacterial ATPFOB, but with only 
moderate statistical support (red asterisk: 0.7 posterior probability 
in MrBayes, 50%, and 46'/o bootstrap support in PhyML and 
RaxML, respectively). The other ATPFOB's group together (blue 
asterisk), with good statistical support by MrBayes (posterior 
probability: 1) but with low support in PhyML and RaxML (24% 
and 26% bootstrap support, respectively). The grouping of the 
alpha-proteobacterial ATPFOB and ATPFOB' may indicate that 
this duplication happened more recently than the ATPFOB 



duplications in the other lineages. However, given the low 
bootstrap support it remains unclear from the tree whether the 
ATPFOB/OB' duplication happened independendy in the different 
lineages where it is observed, or whether it happened only once in 
the common ancestor of all the lineages where it is obser\ (:d (and 
presumably lost in other lineages, e.g. the beta-gamma-proteo- 
bacteria); however, the latter scenario is more plausible based on 
parsimony considerations. 

Notable absences are the ATPFID in N-ATPase, as well as in 
dictyoglomi (Diclyoglomus ihermophilum and Diclyoglomus lurgi- 
dum), ATPFOC in Wolinella succinogenes (epsHon-proteobacteria), 
ATPFIB and ATPFIE in the cyanobacterium Microcoleus 
chthonoplastes, and ATPI missing from many species (e.g. 
chloroflexales, verrucomicrobia). At least some of these absences 
may of course be due to incomplete annotation or extreme 
sequence divergence. 

Evolution of atpl, si, and R 

ATPI has been the least studied subunit of the FqFi ATP 

synthase complex. As mentioned above, ATPI (K02116) is 
interchangeably associated with two pfam domains (pfam03899- 
ATP_synthI and pfam09527-ATPase_genel), which makes ortho- 
logous gene assignments problematic. The bacterial unci gene 
encoding a small transmembrane protein which includes the 
pfam03899 domain, has been demonstrated to have a chaperone 
role in assisting the assembly of the c-ring of the Fq subcomplex 
[35,36]. By analogy, it has been suggested that the atpR gene of 
the N-ATPase (characterized by the presence of the pfam 12966 
domain) plays a similar role, in the absence oi unci [32]. Given 
this suggestion, and the grouping of the alpQ genes (which include 
the pfam09527 domain) into the same KEGG cluster as unci, 
Eilong with the fact that all three encode proteins of similar size 
and, based on their position in the genetic cluster, could be the 
result of gene duplications, we decided to analyze their evolution- 
ary relationship in more detail. 

The phylogenetic reconstruction of ATPI (K02 1 1 6) protein 
sequences, including "sI" proteins containing the pfam03899- 
ATP_synthI domain, "I" proteins containing the pfam09527- 
ATPase_genel, and "R" proteins containing the pfaml2966-atpR 
domain (found in the N-ATPase locus) is shown in Figure S7. 
Overall the three types of proteins look similar in the alignment, 
although atpR stands out, as do the cyanobacterial si serjuences; 
the delta-proteobacterium Desulfovibrio piger has a prominent 
50aa C-terminal extension (Dataset SI). Only the PhyML tree is 
shown, even though the bootstrap support for most branches is not 
significant. Phylogenetic analysis with the same set of sequences 
using MrBayes failed to converge on a tree, and the RaxML tree 
had very bad resolution. The low resolution and low bootstrap 
support are probably due to the short serjuence length and high 
divergence of these sequences. Nevertheless, the tree does separate 
a cluster of the "I" proteins (which contain the pfam09527- 
ATPase_genel domain) to the left of the dotted grey line, and 
another cluster containing the "si" proteins (pfam03899-ATP_ 
synthi domain) and "R" proteins (pfaml2966-atpR domain) to the 
right of the grey dotted line. Based on the gene locus organization 
and the protein sizes, the genes encoding the "si" and "R" 
proteins look like duplications of the "I" gene, and the tr(-c" indeed 
supports this hypothesis. However, due to the low resolution of the 
phylogenetic analysis, the issue of the origin and functional 
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homology of atpl, si, and R would ultimately need to be resolved 
with structural and functional analysis. 

Discussion 

Phylogenetic analysis of 16S rRNA for 272 species chosen to 
represent all the major prokaryotic lineages and bioenergetic 
modes indicated that, overall, there is no monophyly of 
bioenergetic modes (one notable exception is oxygenic photosyn- 
thesis which is confined to the cyanobacteria). This analysis also 
highlighted lineages which include species with vastly dilferent 
modes of generating energy (e.g. proteobacteria, firmicutes). The 
scattered distribution of certain bioenergetic modes, such as 
anoxygenic photosynthesis or iron oxidation, indicates rampant 
HGT of at least some bioenergetic modes, in agreement with 
previous analyses [16,17,18]. All these bioenergetic pathways also 
include the ATP synthase complex, but phylogenetic analysis of all 
the ATPFqFi synthase subunits, common to almost all bacterial 
lineages, largely agree with the 16s rRNA tree. This indicates that, 
if different bioenergetic pathways dispersed into different lineages 
by horizontal gene transfer, this did not involve the ATP synthase 
complex. Presumably, each species used its pre-existing ATP 
synthase complex and adapted it to utilize the proton gradient 
generated by vastly different ETCs. Recent data has shown that 
large-scale HGT from bacteria transformed the bioenergetic 
capabilities of the Haloarchaea [37] and yet Haloarchaea retain 
ATPV, whereas their laterally acquired bioenergetics modes utilize 
ATPF in the bacteria. This is in agreement with our results, and 
again indicates flexibility in combining a species' pre-existing ATP 
synthase with a newly acquired electron transport chain. Given the 
widespread effect of HGT on prokaryotic evolution [38,39,40], it 
may be that the cost of incorporating a laterally transferred ATP 
synthase to replace a pre-existing enzyme is too high to overcome 
[41]. To our knowledge the question of whether specific 
modifications are needed for the ATP synthase to function with 
diflerent bioenergetic modes has not been addressed previously, so 
this current, updated large-scope study allows us to resolve this 
issue, and suggests that no apparent such modifications exist, at 
least at the sequence level. A more thorough structural analysis 
would be needed to examine if certain structural modifications 
unite the ATP synthases of organisms using each bioenergetic 
pathway. 

HGT has happened however, for a variant form of the ATP 
synthase, previously named N-ATPase, as it includes residues in 
the c subunit for translocating Na""" [32]. This is found always in 
addition to the FoFj ATP synthase, in certain species from 
different bacterial lineages, as well as in two Melhanosarcina 
species of the archaea. The N-ATPase subunits always cluster 
independendy of their FqFi counterparts, and although they often 
group closest to the dictyoglomi, only the ATPFOC phylogeny has 
significant bootstrap support for a grouping of the dictyoglomi and 
N-ATPase; therefore, their exact origin cannot be inferred from 
the tree, and possibly predates the separation between ATPV and 
ATPF [32] . The N-ATPase locus is characterized by the absence 
of the ATPF ID subunit, and the presence of the atpR gene (also 
see below). Interestingly, the two dictyoglomi species studied here 
(the only two for which complete genome information is available) 
also lack the ATPF ID subunit, which in combination with the 
close affinity of the dictyoglomi and the N-ATPase in most of the 
trees, might suggest that the dictyoglomi are the closest relative to 
the common ancestor of the N-ATPase. In the gamma- 
pro teobacterium Nilrosococcus halophilus, two copies of the N- 
ATPase are found (one locus is split in half, both are missing atpR), 
whereas Chlorobaculum tepidum of the chlorobi only has half the 



locus; the lack of certain subunits may indicate a non-functional 
degenerate N-ATPase. It is assumed that the N-ATPase confers a 
selective advantage in high-salt environments [32]. 

Given the ancient origin of the FoFj ATPase, the phylogenetic 
trees can perhaps give clues as to the evolutionary relationships 
between different bacterial lineages. The branching order of 
bacterial lineages remains an issue unresolved through phyloge- 
netic analysis [25,27,28,29], cJthough other methods have also 
been proposed based on signature sequences of certain crucial 
proteins [26], and a more recent analysis based on feature 
frequency profiles in whole proteome data has produced a well- 
resolved tree [30]. Some of the FqFi ATP synthase subunits are 
relatively long proteins, and relatively slow evolving due to their 
interactions with the other subunits, so they may retain some of the 
evolutionary signal that cannot be retrieved from 16S rRNA 
sequences. There is consistent support for a grouping of the beta- 
and gamma-proteobacteria, another of the chlorobi and the 
bacteroidetes, and some support for this group also including the 
planctomycetes, the actinobacteria, the alpha-proteobacteria and 
the spirochaete Leptospira interrogans and the gemmatimonadete 
Gemnuitimonas auranliaca; Candidatus Nitrospira defluvii groups 
with the alpha-proteobacteria. Some trees also indicate a subgroup 
containing the verrucomicrobia and the chloroflexi, and possibly 
also the beta-gamma-proteobacteria. Finally, reasonable support is 
provided in the ATPFOA tree for the grouping of dictyoglomi and 
cyanobacteria, and for a subgroup containing the fusobacteria, 
tenericutes, firmicutes, thermotogae, and beta-gamma-proteobac- 
teria. The groupings of (i) the beta-gamma proteobacteria, (ii) the 
chlorobi and bacteroidetes, and (iii) the fusobacteria, tenericutes, 
firmicutes, and thermotogae, are in agreement with the more 
recent phylogeny [30] . 

The order of the genes encoding the FqFi ATP synthase 
subunits is relatively well conser\'ed overall in most of the species 
analyzed, although the locus has been split on multiple occasions, 
and the genes for ATPF IB and ATPF IE arc found either 
upstream (in the N-ATPase and in Bacteroides fragilis) or, most 
commonly, downstream of all the others. Duplications of each of 
the FoFi ATP synthase subunits are observed in several species, 
either within the genetic locus or in distant parts of the genome. 
The history of these duplications can be traced by looking at the 
phylogenetic analysis. The most ancient in-locus duplication is 
likely that of atpl, with the diversification of the downstream copy 
into "si" and "R", but with multiple losses in various lineages of 
either one or both copies. Another ancient in-locus duplication is 
that of ATPFOB, which probably occurred in the common 
ancestor of the acidobacteria, aquificae, cyanobacteria, deferri- 
bacteres, delta- epsUon- and alpha-proteobacteria, (and presum- 
ably lost in other lineages, e.g. the beta-gamma-proteobacteria). 
Most of the other duplications have occurred in isolated species, 
and appear to be species-specific, with no unassailable evidence of 
HGT. 

Thc'se duplications raise the question of how certain species deal 
with gene dosage effects, e.g. to co-ordinate the ATP synthase 
complex structure. As there is no clear evidence of HGT, apart 
from the N-ATPase clade, most duplications seem to be the result 
of stochastic events that have not been bred out; presumably this 
means that at least some of these duplications, e.g. the ATPFOB/ 
OB' duplication may confer a selective advantage, although this 
would need to be confirmed experimentally. A recent study of the 
ATPV complex showed that such paralogous expansions can lead 
to increased complexity (and possibly also specificity) of a multi- 
subunit molecular machine [42] . Moreover, ATPFOB functions as 
a dimer, even in species where only one copy exists in the genome, 
and the two parts of the dimer interact with different parts of the 
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Fi and the Fq subcomplex [43,44]. Thus a gene duplication which 
allows each gene copy to fine-tune specific interactions may indeed 
be advantageous. Notably, cyanobacterial ATPFOB/OB' have 

been successfiiUy inserted into a null E. cnli strain (which lacks its 
native single ATPFOB) and form heterodimers which assemble 
with the rest of the native ATP synthase E. coll subunits to form a 
functional enzyme [45] . This again points to a flexibility of the 
ATP synthase in different species, to accommodate changes and 
duplications. 

The loss of ATPF 1 D from N- ATPase and dictyoglomi, as well 
as ATPI from many species also raises the question of the 
essentiaUty of these subunits for the function of the FqFi ATP 
synthase. The absence of certain subunits in isolated species 
(ATPFOC from Wolinella succinogenes (epsilon-proteobacteria), 
ATPF IB and ATPF IE from the cyanobacterium Microcoleus 
chthonoplastes) may be due to incomplete annotation or extreme 
sequence divergence, although if they represent true losses, again 
this raises questions as to the functioncility of the ATP synthase in 
these species. 

Overall, this analysis highlights the patchy distribution of 
bioenergetic modes across prokaryotic lineages, which is most 
likely the result of HGT. However, there is no evidence of HGT 
for the ATP synthase to accompany the spread of bioenergetic 
pathways in different lineages. This means that the ATP synthase 
cannot be used to reconstruct the origin of the diversity of 
bioenergetic modes in prokaryotes. It also indicates that there are 
no apparent specific modifications of the FgFi ATP synthase in 
order for it to work with different bioenergetic ETCs. The FqFi 
ATP synthase genetic locus is overall well conser\'ed, although as 
demonstrated by multiple splits and duphcations, in principle, the 
system is robust and flexible, as it can deal with a split between any 
subunits and/ or a duplication of any subunit. The elucidation of 
the way in which certain species deal with these duplications, splits 
and losses, and the advantage any of these may confer, now 
requires further study. 

Materials and Methods 

Organism selection 

Bacteria and archaea species, whose genomes have been 
completely sequenced and are available at NCBI, were chosen 
by parsing the NCBI Genome Project database (http://www.ncbi. 
nlm.nih.gov/bioproject) with keywords relating to the relevant 
metabolisms (e.g. "anoxygenic phototroph"), and the relevant 
phyla (e.g. "chlorobi"). For autotrophs and chemolithotrophs, afl 
relevant species were examined, but for heterotrophs, only a 
sample of species was examined so as to cover the fuU diversity of 
bacteria and archaea [31] (http://tolweb.org/tree/) and the fuU 
bioenergetic diversity per fineage. For lineages with many 
sequenced genomes, the tree of [31] was used to pick species so 
as to cover as much phylogenetic diversit}' as possible with a 
limited number of species. The set of species selected, represent 
131 clusters, with a genome similarity score (GSS) threshold of 0.5; 
of those, 24 are in "clusters" which only have one member, and 63 
are the sole representatives from their cluster [46] . Information on 
the metabolic mode of aU species was also cross-checked in the 
IMG database [47]. Each species name was assigned an 8- 
character abbreviation for better data handling during the 
phylogenetic analysis, by keeping the first two letters of the first 
name and the first three letters of the second name, as well as a 2- 
3 letter ending, denoting the bioenergetic mode. Details of all the 
272 organisms analyzed, and of the species names abbreviations 
are given in Table SI. 



Sequence retrieval and phylogenetic analysis 

16S rRNA sequences were downloaded pre-aligned from the 
RDP database [48]. When more than one sequence was available 
for each species/strain examined, one of the good-quality > 
1200 bp sequences was selected at random, unless the type 
sequence was available, in which case that was selected. 
Importandy, we used data from the same strain for the 16S 
analysis and the ATP analysis (see below). As bacterial and 
archaeal sequences are provided as separate pre-aligned files, the 
program opal was used to align the two sets [49]. Common gaps 
were removed after manual examination of the whole set of 
sequences in McClade. The nucleotide substitution model that 
best fits the data (GTR-I-I-I-G) was selected using the program 
ModelGenerator [50] (http://bioinfnuim.ie/modelgenerator/). 

All other analyses were done at the amino acid level. For the 
ATP synthase subunits, sequence accession numbers were 
retrieved using the ortholog tables from the KEGG database: 
KEGG ortholog tables are based on RefSeq annotations, sequence 
similarity and best-hit searches, as weU as tools for operon-like 
consistency and completeness of pathway modules and complexes; 
furthermore they are regularly updated (http://www.kegg jp/ 
kegg/ko.html). In cases where data was missing from the KEGG 
database, this was supplemented by data from IMG [47], manual 
analysis to find the best reciprocal BLAST hits, as weU as synteny 
considerations, since the gene order of the ATP synthase locus is 
well-conserved overall. The accession numbers of all serjuences 
analyzed, and the corresponding species names abbreviations, are 
given in Table SI. Sequences were downloaded from KEGG in 
fasta format using a custom perl script. Alignments were created 
using MUSCLE [51]. Only unambiguous homologous regions 
were retained for phylogenetic analysis by manuaUy inspecting and 
masking/ trimming the sequences in McClade (the masked 
alignment are given in Dataset SI). ProtTest [52] was used to 
estimate the appropriate model of sequence evolution. 

Phylogenetic analysis was performed by three separate methods. 
To obtain the Bayesian tree topology and posterior probability 
values, the program MrBayes version 3.1.2 was used [53]. 
Analyses were run for 1-5 million generations, removing all trees 
before a plateau estabhshed by graphical estimation. All calcula- 
tions were checked for convergence and had a splits frequency of 
<0.1. Maximum-likelihood (ML) analysis was performed using 
PhyML [54] and RAxML [55] with 100 bootstrap replicates. 
Nodes with better than 0.95 posterior probability and 80% 
bootstrap support were considered robust, and nodes with better 
than 0.80 posterior probability and 50% bootstrap support are 
shown. Tree files were processed in Figtree vl.4 and Adobe 
lUustrator to highlight homologous groups, and colour-code 
species names based on bioenergetic mode. 

Genetic locus analysis 

As the genes encoding the different subunits of the ATP 
synthase are normally clustered in an operon, the genetic locus of 
the sequences analyzed was examined in the IMG database [47] . 
Details of the locus organization in each species are given in Table 
SI and the data is summarized per lineage in Figure 4. 

Supporting information 

Figure SI Phylogenetic reconstruction of ATPFOB. The 

tree shown is the best Bayesian topology, based on 298 sequences 
and 161 amino acid positions (length after trimming; median 
sequence length before trimming: 170). Numerical values at the 
nodes of the tree (x/ y/ z) indicate statistical support by MrBayes, 
PhyML and RAxML (posterior probabihty, bootstrap and 
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bootstrap, respectively). Values for highly supported nodes have 
been replaced by symbols, as indicated. Species names are colour- 
coded based on their bioenergetic mode. Full details and accession 
numbers for all protein sequences used are given in Table S 1 . The 
tree is rooted at the N-ATPase clade, previously reported to be the 
result of horizontal gene transfer in a variety of species, all of 
which also contain a canonical ATPFgFi (apart from the two 
Methanosarcina species shown which also have a canonical 
ATPV). The tree confidently separates the major bacterial 
taxonomic lineages, but with no clear support for their branching 
order; notably however, the dictyglomi cluster with the N-ATPase 
with good statistical support. There are multiple duplications, most 
of which represent the in-locus duplication of ATPFOB/OB' seen 
in acidobacteria, aquificae, cyanobacteria, deferribacteres, delta- 
epsUon- and alpha-proteobacteria. The ATPFOB' group in the 
alpha-proteobacteria appears as a sister group to the alpha- 
proteobacterial ATPFOB, although with only moderate statistical 
support (red asterisk: 0.7 posterior probability in MrBayes, 50%, 
and 46% bootstrap support in PhyML and RaxML, respectively). 
The other ATPFOB's group together (blue asterisk), with good 
statistical support by MrBayes (posterior probability: 1) but with 
low support by PhyML and RaxML (24% and 26% bootstrap 
support, respectively). It is thus unclear from the tree whether the 
ATPFOB/OB' duplication happened independently in the different 
lineages where it is observed, or whether it happened only once in 
the common ancestor of all the lineages where it is observed (and 
presumably lost in other lineages, e.g. the beta-gamma-proteo- 
bacteria); however, the latter scenario is more plausible based on 
parsimony considerations. Two species-specific duplications in 
PelohacLer carhinolicus are highlighted with a red ">". Four more 
duplications are highlighted with a red "-" after the species 
name: in Photobacterium profundum the duplication either 
occurred before the split from other closely-related species or 
represents HGT from other gamma-proteobacteria; in Desulfo- 
coccus oleovnran.s the duplication (which is a duplication of the 
full ATPFO locus) seems to be species-specific, contrary to what 
is seen for ATPFOA in Figure 2 and ATPFOC in Figure S2; it is 
unclear if the duplication in the zeta-proteobacterium Mar- 
iprofundus ferrooxydans is the result of HGT, as the sequence 
groups with planctomycetes, but not with high bootstrap 
support. The duplication in Methylacidiphilum infernorum 
(highlighted with a yellow "-" after the species name) represents 
the ATPFOB within the N-ATPase locus, but it did not group 
with the other N-ATPase ATPFOBs, probably due to its long 
branch length. 
(EPS) 

Figure S2 Phylogenetic reconstruction of ATPFOC. The 

tree shown is the best Bayesian topology, based on 214 sequences 
and 77 amino acid positions (length after trimming; median 
sequence length before trimming: 81). Numerical values at the 
nodes of the tree (x/y/z) indicate statistical support by MrBayes, 
PhyML and RAxML (posterior probability, bootstrap and 
bootstrap, respectively). Values for highly supported nodes have 
been replaced by symbols, as indicated. Species names are colour- 
coded based on their bioenergetic mode. Full details and accession 
numbers for all protein sequences used are given in Table SI. The 
tree is rooted at the N-ATPase clade, previously reported to be the 
result of horizontal gene transfer in a variety of species, all of 
which also contain a canonical ATPFoFj (apart from the two 
Methanosarcina species shown which also have a canonical 
ATPV). The tree confidently separates the major bacterial 
taxonomic lineages, but with limited support for their branching 
order: reasonable support is only provided for one subgroup 
containing the chlorobi, bacteroidetes and planctomycetes 



(as well as the spirochaete Leptospira interrogans and the 
gemmatimonadete Gemmatimonas aurantiaca). Candidatus Ni- 
trospira defluvii groups with the alpha-proteobacteria. Two 
species-specific duplications (in PelohacLer carhinolicus and 
Alkaliphilus metalliredigens) are highlighted with a red ">". 
Two further duplications are highlighted with a red "-"after the 
species name; in Photobacterium profundum the duplication either 
occurred before the split from other closely-related species or 
represents HGT from other gamma-proteobacteria; the duplica- 
tion in Desulfococcus oleovorans possibly represents HGT from 
thermotogae (also see Figure 2). 
(EPS) 

Figure S3 Phylogenetic reconstruction of ATPFIB. The 

tree shown is the best Bayesian topology, based on 2 1 5 serjuences 
and 458 amino acid positions (length after trimming; median 
sequence length before trimming: 470). Numerical values at the 
nodes of the tree (x/ y/z) indicate statistical support by MrBayes, 
PhyML and RAxML (posterior probability, bootstrap and 
bootstrap, respectively). Values for highly supported nodes have 
been replaced by symbols, as indicated. Species names are colour- 
coded based on their bioenergetic mode. FuU details and accession 
numbers for all protein sequences used are given in Table SI. The 
tree is rooted at the N-ATPase clade, previously reported to be the 
result of horizontal gene transfer in a variety of species, all of 
which also contain a canonical ATPFoFi (apart from the two 
Methanosarcina species shown which also have a canonical 
ATPV). The tree confidently separates the major bacterial 
taxonomic lineages, but with limited support for their branching 
order: reasonable support is only provided for one subgroup 
containing the chlorobi and the bacteroidetes. Two species-specific 
duplications (in Photobacterium profundum and Pelobacter 
carhinolicus) are highlighted with a red ">". Two further 
duplications within the tenericutes are highlighted with a red "-" 
after the species name; this duplication likely happened before the 
split between Mycoplasma agalactiae and Vreaplasma parvum. 
(EPS) 

Figure S4 Phylogenetic reconstruction of ATPFID. The 

tree shown is the best Bayesian topology, based on 189 sequences 
and 180 amino acid positions (length after trimming; median 
sequence length before trimming: 181). Numerical values at the 
nodes of the tree (x/ y/ z) indicate statistical support by MrBayes, 
PhyML and RAxML (posterior probability, bootstrap and 
bootstrap, respectively). Values for highly supported nodes have 
been replaced by symbols, as indicated. Species names are colour- 
coded based on their bioenergetic mode. FuU details and accession 
numbers for all protein sequences used are given in Table S 1 . The 
tree is rooted at Thermotogae , which is generally accepted as being 
one of the ancestral lineages of the bacteria (N-ATPase has no 
ATPFID). The tree confidentiy separates the major bacterial 
taxonomic lineages, but with no clear support for their branching 
order. One species-specific duplication is highlighted with a red 
">" in Pelobacter carhinolicus. Two further duplications are 
highlighted with a red "-" after the species name: in Photobacter- 
ium profundum the (lineage-specific) duplication either occurred 
before the split from other closely-related species or represents 
HGT from other gamma-proteobacteria; the evolutionary history 
of the duplication in Ureaplasma parvum cannot be clearly 
inferred from the phylogenetic analysis; it appears to be species- 
specific in the PhyML and RaxML trees, but is not statistically 
supported by high bootstrap values. 
(EPS) 

Figure S5 Phylogenetic reconstruction of ATPFIE. The 

tree shown is the best Bayesian topology, based on 221 sequences 
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and 138 amino acid positions (lengtli after trimming; median 
sequence length before trimming: 137). Numerical values at the 
nodes of the tree (x/y/ z) indicate statistical support by MrBayes, 
PhyML and RAxML (posterior probability, bootstrap and 
bootstrap, respectively). Values for highly supported nodes have 
been replaced by symbols, as indicated. Species names are colour- 
coded based on their bioenergetic mode. Full details and accession 
numbers for all protein sequences used are given in Table SI. The 
tree is rooted at the N-ATPase clade, previously reported to be the 
result of horizontal gene transfer in a variety of species, all of 
which also contain a canonical ATPFqFi (apart from the two 
Methanosarcina species shown which also have a canonical 
ATPV). The tree confidendy separates the major bacterial 
taxonomic lineages, but with no clear support for their branching 
order. Three species-specific duplications are highlighted with a 
rv.d ■'>" in PelobacLer carbinolicus, Photohacterium profundum, 
and Mariprofundus ferrooxydans. Lineage-specific duplications 
(the duplication either occurred before the split from other closely- 
related species or represents HGT from other closely-related 
species) are highhghted with a red "-" after the species name: one 
duplication seems to have occurred before the split between 
Desulfovihrio magnelicus and Desulfovihrio sp. FW1012B in the 
delta-proteobacteria; another duplication is seen in the beta- 
proteobacterium Thiobacillus denitrificans (which may represent 
HGT from other gamma-proteobacteria); finally a duplication 
occurred before the- split between the gamma-proteobacteria 
Acidilhiohacillus ferrooxidans , Acidithiobacillus ferrivorans and 
Acidithiohacillus caldus, with a further species-specific duplication 
in Acidithiobacillus ferrooxidans. 
(EPS) 

Figure S6 Pfaylogenetic reconstructioii of ATPFIG. The 

tree shown is the best Bayesian topology, based on 215 sequences 
and 291 amino acid positions (length after trimming; median 
sequence length before trimming: 291). Numerical values at the 
nodes of the tree (x/ y/ z) indicate statistical support by MrBayes, 
PhyML and RAxML (posterior probability, bootstrap and 
bootstrap, respectively). Values for highly supported nodes have 
been replaced by symbols, as indicated. Species names are 
colour-coded based on their bioenergetic mode. Full details and 
accession numbers for all protein sequences used are given in 
Table SI. The tree is rooted at the N-ATPase clade, previously 
reported to be the result of horizontal gene transfer in a variety of 
species, all of which also contain a canonical ATPFqFi (apart 
from th(^ two Methanosarcina species shown which also have a 
canonic:al ATPV). The tree confidently separates the major 
bacterial taxonomic lineages, but with limited support for their 
branching order: reasonable support is only provided for one 
subgroup containing the chloroflexi, beta-gamma-proteobacteria 
and the verrucomicrobia, and another subgroup containing the 
actinobacteria and the planctomycetes (as well as the spirochaete 
Leptospira interrogans and the- gemmatimonadete Gemmalimonas 
aurantiaca). One species-specific duphcation in Pelobacter 
carbinolicus is highlighted with a red ">". Two further 
duplications are highlighted with a red "-" after the species 
name; in Photobacterium profundum the duplication either 
occurred before the split from other closely-related species or 
represents HGT from other gamma-proteobacteria; the duplica- 
tion in Aquifex aeolicus represents a very divergent sequence 
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